Basketball is a game played between two teams of 5-players on a rectangular court indoors or outside. Teams score by having the ball pass through a ring approximately 24cm in diameter and 10feet (3.048m) high (1). The ball can pass through the scoring ring by being placed or thrown (a shot) and may accrue scores of 1, 2 or 3 points depending on circumstances. In summary 3-points are scored for a successful shot when taken beyond the 3-point perimeter line (fig. 1). Two-points are scored for a successful shot during standard play but taken within the 3-point perimeter line. One-point is scored for each ‘free-throw’ that may be taken when the opposition team commits a ‘foul’ as judged by one of two on-court referees under the rules of the game.
Games are usually played over four quarters of approximately 12-minutes (48min total), however this is not standardized across all official competitions. The team with the most accrued points at the end of full-time wins. The pre-eminent competition in the world is the National Basketball Association (NBA) in the USA (2).
Figure 1. Basketball Court (half) showing 3-point perimeter line and player positions
Five players from each team take the court at any given time. In the NBA up to 7-players can be sitting court-side and be substituted in to the game during allocated times in the game. A full team annual roster can comprise a maximum of 15-players. The five players that are on the court at the start of the game (starting players) are generally considered the teams best playing combination.
There are 5 specific playing positions (fig. 1) that align with particular physical, technical or playing role attributes of each player. The Center (C) and Power Forward (PF) are generally the tallest and strongest players. The Small Forward (SF) is slightly smaller and more agile than the PF. The Shooting Guard (SG) and Point Guard (PG) are commonly the smallest and most agile players adept at ball handling and leading the on-court team strategies. The tallest players (C, PF) are often the highest scoring players but also effective in defending or blocking the opposition scoring opportunities. The smaller players (SF, SG, PG) that may not score as highly are often effective in controlling team possession and ‘assisting’ scoring through delivery of the ball to the taller scoring players.
Basketball has many opportunities for statistical analyses. Common foundation metrics collected for analysis include Games played by each player, team winning percentages, team and individual scoring averages. Other metrics may assist in analysing the methods behind scoring and defending such as ‘assists’ where a player has provided the final delivery of the ball to the player that scores as a result; ‘rebounds’ where a player jumps to collect the ball in either attack or defense to take possession of a scoring shot that missed passing through the ring. This metric reflects a teams capacity to gain possession and control of the ball for the next scoring opportunity; ‘blocks’ occur when a player directly blocks a scoring attempt by the opposition in the act of attempting to score. Player value may be evaluated by their effectiveness to score points for their team, but also in their capacity to prevent the opposition form scoring.
A link to common data filed metrics can be found here
The Chicago Bulls (fig 2.) are one of 30 privately owned teams playing in the NBA.
Teams play 82 games per season. Due to the private enterprise structure of the NBA teams do not have equal resources to field a team each season. The Chicago Bulls in season 2018-19 were ranked 24th of 30 for the size of their player contract budget ($112.5 million) (fig 3.)
Figure 3. NBA team payrolls 2018-19.
For the coming NBA season (2019-20) the owners of the Chicago Bulls would like to use data analytics methods to evaluate and find a starting 5 combination that will enhance their potential to win a higher percentage of games in the season. Last year the Chicago Bulls won approximately 27% of their games (fig 4.). The owners have increased slightly the annual payroll for the coming 2019-20 season to $118 million. The challenge to address is to discover a starting 5 combination from across the league, ensuring each playing position is filled, that maximises their opportunity to both score highly and restrict the opposition score, yet remains within the salary cap. Also to consider is that the payroll must ultimately account for a full roster of 15 players. Ideally an improved roster of players will see an increased winning percentage.
Figure 4. NBA team games winning percentages 2018-19.
Data analytics can develop methods to expose players within the NBA system that have playing attributes, as measured by the available data metrics, that are undervalued in their attacking or defensive effectiveness. The analytics methods can devise algorithms that will identify the key metrics associated with scoring or defensive success as well as associated with winning percentage. The Chicago Bulls were ranked 29th of 30 for points scored per 40 minutes played in the 2018-19 season (fig. 5). By identifying the key metrics, developing an effective algorithm and applying it to the full NBA playing cohort, players that show high scoring or defensive effectiveness, yet come at an affordable price, can be identified and approached to play with the Chicago Bulls.
Figure 5. Team scoring rates 2018-19 (points / minute)
Each of the data.files provided is named and read in
df_sal <- read_csv("data_raw/2018-19_nba_player-salaries_raw.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## player_id = col_double(),
## player_name = col_character(),
## salary = col_double()
## )
df_players <- read_csv("data_raw/2018-19_nba_player-statistics_raw.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## player_name = col_character(),
## Pos = col_character(),
## Tm = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
df_team_payroll <- read_csv("data_raw/2019-20_nba_team-payroll_raw.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## team_id = col_double(),
## team = col_character(),
## salary = col_character()
## )
df_team_ptsMP <- read_csv("data_raw/2018-19_nba_team-statistics_2_raw.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Team = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
df_team_WL <- read_csv("data_raw/2018-19_nba_team-statistics_1_raw.csv")
## Warning: Missing column names filled in: 'X23' [23], 'X24' [24], 'X25' [25]
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Team = col_character(),
## X23 = col_logical(),
## X24 = col_logical(),
## X25 = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.
df_team_payroll$COL1 <- as.numeric(gsub('[$,]', '', df_team_payroll$salary))
df_team_payroll <- mutate(df_team_payroll, COL2 = COL1/1000000)
df_team_pts40 <- mutate(df_team_ptsMP, team_pts40 = (PTS / MP)* 40 )
Season Team Payroll
# Barplot team payrolls
team_sal <- ggplot(data = df_team_payroll, aes(x = reorder(team, COL2), y = COL2)) +
geom_bar(stat="identity")
# Horizontal bar plot
new <- team_sal + coord_flip(ylim = c(50, 160))
#label axes
new + labs(title = "Team payroll (2018-19)", face = "italics", x = "Team", y = "Millions")
Win % versus team salary
df_team_WL <- mutate(df_team_WL, winP = (W / (W + L)* 100))
#add win% column
df_team_pts40 <- bind_cols(df_team_pts40, df_team_WL[-c(1:25)])
# team average pts per minute
team_pts40 <- ggplot(data = df_team_pts40, aes(x = reorder(Team, winP), y = winP)) +
geom_bar(stat="identity")
# Horizontal bar plot
new1 <- team_pts40 + coord_flip(ylim = c(20, 75)) +
labs(title = "Team game win percentages (2018-19)", x = "Team", y = "Win %")
new1
Teams scoring rate 2018-19 (score / 40 minutes played)
# team average pts per minute
ggteam_pts40 <- ggplot(data = df_team_pts40, aes(x = reorder(Team, team_pts40), y = team_pts40)) +
geom_bar(stat="identity")
# Horizontal bar plot
new2 <- ggteam_pts40 + coord_flip(ylim = c(17, 20)) +
labs(title = "Team average points scored per 40 minutes played (2018-19)", x = "Team", y = "Points / 40 Minutes")
new2
df_team_pts40 <- df_team_pts40 %>%
mutate(AST40 = (AST / MP)* 40,
TOV40 = (TOV / MP)* 40,
STL40 = (STL / MP)* 40,
BLK40 = (BLK / MP)* 40,
TRB40 = (TRB / MP)* 40,
ORB40 = (ORB / MP)* 40,
DRB40 = (DRB / MP)* 40)
# join df_sal & df_pl_stat by "player_name"
df_players <- left_join(x = df_sal, y = df_players, by = c("player_name"))
merge rows, add column sal.mil
#add new column to player file that include salary
df_players <- mutate(df_players, salmil = salary / 1000000)
Clean processed data file
sum(is.na(df_players))
## [1] 1901
naniar::vis_miss(df_players)
#filter out players with no position or game time
df_pl_clean <- drop_na(df_players, Pos)
naniar::vis_miss(df_pl_clean)
Create normalised data by adding variable “points per 40 minutes played” Explore the data and filter out games less than 10 to minimise outliers
#add standardised scoring (/40 min) to individual players
df_players <- mutate(df_players, pts40 = (PTS / MP)* 40)
#data not yet filtered
df_players %>%
group_by(player_id) %>%
ggplot() +
geom_histogram(mapping = aes(x = pts40), colour = "black", fill = "dodgerblue") +
labs(x = "points per minutes played", y = "number of players", title = "Distribution of points scored per minute played", subtitle = "(games played > 10)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 64 rows containing non-finite values (stat_bin).
#remove outliers created when games are fewer than 10
df_players %>%
group_by(player_id) %>%
filter(G > 10) %>%
ggplot() +
geom_histogram(mapping = aes(x = pts40), colour = "black", fill = "dodgerblue") +
labs(x = "points per minutes played", y = "number of players", title = "Distribution of points scored per minute played", subtitle = "(games played > 10)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation team wins and variables of interest (all variables are standardised to ‘x / 40minutes’)
#points/40min vs win%
df_team_pts40 %>%
ggplot(aes(x = team_pts40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$team_pts40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.9569494
#assists (AST) vs win%
df_team_pts40 %>%
ggplot(aes(x = AST40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$AST40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.548208
#turnovers (TOV) vs win%
df_team_pts40 %>%
ggplot(aes(x = TOV40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$TOV40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.1385349
#steals (STL) vs win%
df_team_pts40 %>%
ggplot(aes(x = STL40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$STL40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.2818981
#blocks (BLK) vs win%
df_team_pts40 %>%
ggplot(aes(x = BLK40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$BLK40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.4495809
#total rebounds (TRB) vs win%
df_team_pts40 %>%
ggplot(aes(x = TRB40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$TRB40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.5195445
#offensive rebounds (ORB) vs win%
df_team_pts40 %>%
ggplot(aes(x = ORB40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$ORB40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.2199931
#defensive rebounds (DRB) vs win%
df_team_pts40 %>%
ggplot(aes(x = DRB40,y = winP)) +
geom_point(colour = "dodgerblue") +
ylim(0, 100) +
geom_smooth(method = "lm", colour = "magenta") +
geom_hline(yintercept = 50, colour = "black", linetype = "dashed")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$DRB40, y = df_team_pts40$winP, method = "pearson")
## [1] 0.4901269
The variables of interest that arose from looking at the association between individual variables and winning percentage were:
Several of the associations are moderate at best, however it was interesting to observe that a defensive attribute (blocks) was associated with winning, potentially due to the restriction of the opposition scoring, and total rebounds that is associated with regaining possession. (I chose to follow ‘total rebounds’ as a variable from here rather than simply defensive rebounds, as the total rebounds had a higher correlation with winning)
With Team points scored per 40 min having a very high correlation with winning percentage, I further explored the associations between team scoring rate and each of the variables previously explored.
Correlations between team scoring rate (pts40) and exploratory variables of interest
#assists vs pts40
df_team_pts40 %>%
ggplot(aes(x = AST40,y = team_pts40)) +
geom_point(colour = "dodgerblue") +
geom_smooth(method = "lm", colour = "magenta")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$AST40, y = df_team_pts40$team_pts40, method = "pearson")
## [1] 0.5823599
#turnovers vs pts40
df_team_pts40 %>%
ggplot(aes(x = TOV40,y = team_pts40)) +
geom_point(colour = "dodgerblue") +
geom_smooth(method = "lm", colour = "magenta")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$TOV40, y = df_team_pts40$team_pts40, method = "pearson")
## [1] 0.06309648
#steals vs pts40
df_team_pts40 %>%
ggplot(aes(x = STL40,y = team_pts40)) +
geom_point(colour = "dodgerblue") +
geom_smooth(method = "lm", colour = "magenta")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$STL40, y = df_team_pts40$team_pts40, method = "pearson")
## [1] 0.1702733
#Bblocks vs pts40
df_team_pts40 %>%
ggplot(aes(x = BLK40,y = team_pts40)) +
geom_point(colour = "dodgerblue") +
geom_smooth(method = "lm", colour = "magenta")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$BLK40, y = df_team_pts40$team_pts40, method = "pearson")
## [1] 0.4192942
#total rebounds vs pts40
df_team_pts40 %>%
ggplot(aes(x = TRB40,y = team_pts40)) +
geom_point(colour = "dodgerblue") +
geom_smooth(method = "lm", colour = "magenta")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$TRB40, y = df_team_pts40$team_pts40, method = "pearson")
## [1] 0.5844159
#defensive rebounds vs pts40
df_team_pts40 %>%
ggplot(aes(x = DRB40,y = team_pts40)) +
geom_point(colour = "dodgerblue") +
geom_smooth(method = "lm", colour = "magenta")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$DRB40, y = df_team_pts40$team_pts40, method = "pearson")
## [1] 0.5767078
#offensive rebounds vs pts40
df_team_pts40 %>%
ggplot(aes(x = ORB40,y = team_pts40)) +
geom_point(colour = "dodgerblue") +
geom_smooth(method = "lm", colour = "magenta")
## `geom_smooth()` using formula 'y ~ x'
cor(x = df_team_pts40$ORB40, y = df_team_pts40$team_pts40, method = "pearson")
## [1] 0.2002259
3 explanatory variables associated with win% and points scored per 40 min are:
#combine rows with same player names
df_players <- bind_rows(df_players) %>%
group_by(player_id, player_name, Pos) %>%
summarise(salmil = mean(salmil, na.rm = TRUE),
G = sum(G, na.rm = TRUE),
GS = sum(GS, na.rm = TRUE),
MP = sum(MP, na.rm = TRUE),
TRB = sum(TRB, na.rm = TRUE),
AST = sum(AST, na.rm = TRUE),
BLK = sum(BLK, na.rm = TRUE),
PTS = sum(PTS, na.rm = TRUE))
## `summarise()` has grouped output by 'player_id', 'player_name'. You can override using the `.groups` argument.
ungroup(df_players)
## # A tibble: 592 x 11
## player_id player_name Pos salmil G GS MP TRB AST BLK PTS
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Alex Abrin… SG 3.67 31 2 588 48 20 6 165
## 2 2 Quincy Acy PF 0.214 10 0 123 25 8 4 17
## 3 3 Steven Ada… C 24.2 80 80 2669 760 124 76 1108
## 4 4 Jaylen Ada… PG 0.237 34 1 428 60 65 5 108
## 5 5 Bam Adebayo C 2.96 82 28 1913 597 184 65 729
## 6 6 Deng Adel SF 0.0772 19 3 194 19 5 4 32
## 7 7 Alexis Aji… <NA> 5.29 0 0 0 0 0 0 0
## 8 8 DeVaughn A… SG 0.0772 7 0 22 4 6 0 7
## 9 9 Cole Aldri… <NA> 2 0 0 0 0 0 0 0
## 10 10 LaMarcus A… C 22.3 81 81 2687 744 194 107 1727
## # … with 582 more rows
df_players40 <- bind_rows(df_players) %>%
group_by(player_id, player_name) %>%
summarise(salmil = mean(salmil, na.rm = TRUE),
G = sum(G, na.rm = TRUE),
GS = sum(GS, na.rm = TRUE),
MP = sum(MP, na.rm = TRUE),
TRB = sum(TRB, na.rm = TRUE),
AST = sum(AST, na.rm = TRUE),
BLK = sum(BLK, na.rm = TRUE),
PTS = sum(PTS, na.rm = TRUE))
## `summarise()` has grouped output by 'player_id'. You can override using the `.groups` argument.
df_players <- df_players %>%
mutate(pts40_ind = (PTS / MP)* 40,
AST40 = (AST / MP)* 40,
BLK40 = (BLK / MP)* 40,
TRB40 = (TRB / MP)* 40,
GMP = (MP / G),
Pts_G = (GMP/40)* pts40_ind)
df_players40 <- df_players40 %>%
mutate(pts40_ind = (PTS / MP)* 40,
AST40 = (AST / MP)* 40,
BLK40 = (BLK / MP)* 40,
TRB40 = (TRB / MP)* 40,
GMP = (MP / G),
Pts_G = (GMP/40)* pts40_ind)
#exclude data from score value zero
df_players <- df_players[!is.na(df_players$pts40_ind), ]
#exclude players who haven't played a game
df_players <- df_players[df_players$G != 0, ]
Look for biases towards Player positions and Explanatory Variables (all rates relate to x per 40 minutes payed)
#player group by pts_ind
df_players %>%
filter(G > 10) %>%
ggplot() +
geom_boxplot(mapping = aes(x = reorder(Pos, pts40_ind, FUN = median),
y = pts40_ind, colour = reorder(Pos, pts40_ind,
FUN = median))) +
labs(x = "Player Positions", y = "Scoring Rate") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45))
#player group by AST
df_players %>%
filter(G > 10) %>%
ggplot() +
geom_boxplot(mapping = aes(x = reorder(Pos, AST40, FUN = median),
y = AST40, colour = reorder(Pos, AST40,
FUN = median))) +
labs(x = "Player Positions", y = "Assists rate") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45))
#player group by BLK
df_players %>%
filter(G > 10) %>%
ggplot() +
geom_boxplot(mapping = aes(x = reorder(Pos, BLK40, FUN = median),
y = BLK40, colour = reorder(Pos, BLK40,
FUN = median))) +
labs(x = "Player Positions", y = "Blocking Rate") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45))
#player group by TRB
df_players %>%
filter(G > 10) %>%
ggplot() +
geom_boxplot(mapping = aes(x = reorder(Pos, TRB40, FUN = median),
y = TRB40, colour = reorder(Pos, TRB40,
FUN = median))) +
labs(x = "Player Positions", y = "Total Rebound Rate") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45))
In light of this interesting information from the bias observations is have decided to establish three models.
One model that looks at the relationship of all selected variables and winning percentage.
One model to assess the salary value of attacking players: scoring rate & assists (PG, SF, SG)
One model to assess the salary value of defending players: scoring rate, blocks & total rebounds (C, PF) Scoring rate is included in both idividual player models as it is a universally important metric.
Build Linear Regression Models #### Team win percentage and all exploratory variables
fit <- lm(winP ~ team_pts40 + AST40 + BLK40 + TRB40, data = df_team_pts40)
tidy(fit, conf.int = TRUE)
## # A tibble: 5 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -324. 23.4 -13.8 3.27e-13 -372. -276.
## 2 team_pts40 21.5 1.68 12.8 1.87e-12 18.0 24.9
## 3 AST40 -2.06 3.17 -0.651 5.21e- 1 -8.58 4.46
## 4 BLK40 10.5 8.25 1.28 2.13e- 1 -6.45 27.5
## 5 TRB40 -2.92 2.85 -1.02 3.15e- 1 -8.78 2.95
summary(fit)
##
## Call:
## lm(formula = winP ~ team_pts40 + AST40 + BLK40 + TRB40, data = df_team_pts40)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.915 -3.480 -0.474 3.043 6.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -323.904 23.431 -13.824 3.27e-13 ***
## team_pts40 21.482 1.682 12.771 1.87e-12 ***
## AST40 -2.060 3.166 -0.651 0.521
## BLK40 10.543 8.250 1.278 0.213
## TRB40 -2.917 2.847 -1.025 0.315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.378 on 25 degrees of freedom
## Multiple R-squared: 0.9232, Adjusted R-squared: 0.9109
## F-statistic: 75.14 on 4 and 25 DF, p-value: 1.46e-13
fit_ind_att <- lm(salmil ~ AST40 + pts40_ind, data = df_players40, subset = G > 10)
tidy(fit_ind_att, conf.int = TRUE)
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.76 1.15 -4.99 8.73e- 7 -8.03 -3.49
## 2 AST40 0.766 0.156 4.91 1.30e- 6 0.459 1.07
## 3 pts40_ind 0.612 0.0648 9.44 2.06e-19 0.484 0.739
summary(fit_ind_att)
##
## Call:
## lm(formula = salmil ~ AST40 + pts40_ind, data = df_players40,
## subset = G > 10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.390 -4.623 -1.816 3.788 22.130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.75999 1.15466 -4.988 8.73e-07 ***
## AST40 0.76582 0.15606 4.907 1.30e-06 ***
## pts40_ind 0.61155 0.06478 9.441 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.052 on 447 degrees of freedom
## Multiple R-squared: 0.2436, Adjusted R-squared: 0.2402
## F-statistic: 71.97 on 2 and 447 DF, p-value: < 2.2e-16
fit_ind_def <- lm(salmil ~ BLK40 + TRB40 + pts40_ind, data = df_players40, subset = G > 10)
tidy(fit_ind_def, conf.int = TRUE)
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.34 1.25 -4.26 2.49e- 5 -7.80 -2.87
## 2 BLK40 -0.920 0.545 -1.69 9.22e- 2 -1.99 0.151
## 3 TRB40 0.322 0.124 2.60 9.65e- 3 0.0785 0.565
## 4 pts40_ind 0.658 0.0654 10.1 1.38e-21 0.529 0.786
summary(fit_ind_def)
##
## Call:
## lm(formula = salmil ~ BLK40 + TRB40 + pts40_ind, data = df_players40,
## subset = G > 10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.294 -4.698 -2.066 4.081 26.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.3370 1.2527 -4.260 2.49e-05 ***
## BLK40 -0.9204 0.5453 -1.688 0.09216 .
## TRB40 0.3219 0.1239 2.599 0.00965 **
## pts40_ind 0.6579 0.0654 10.059 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.193 on 446 degrees of freedom
## Multiple R-squared: 0.2148, Adjusted R-squared: 0.2095
## F-statistic: 40.66 on 3 and 446 DF, p-value: < 2.2e-16
Test Assumptions
#team win %
car::durbinWatsonTest(fit)
## lag Autocorrelation D-W Statistic p-value
## 1 0.3818301 1.230993 0.01
## Alternative hypothesis: rho != 0
#attacking variables
car::durbinWatsonTest(fit_ind_att)
## lag Autocorrelation D-W Statistic p-value
## 1 0.04405598 1.909643 0.332
## Alternative hypothesis: rho != 0
#defensive variables
car::durbinWatsonTest(fit_ind_def)
## lag Autocorrelation D-W Statistic p-value
## 1 0.04641895 1.903551 0.316
## Alternative hypothesis: rho != 0
#team win %
car::avPlots(fit)
#attacking variables
car::avPlots(fit_ind_att)
#defensive variables
car::avPlots(fit_ind_def)
#team win %
std_res <- rstandard(fit)
points <- 1:length(std_res)
res_labels <- if_else(abs(std_res) >= 2.5, paste(points), "")
ggplot(data = NULL, aes(x = points, y = std_res)) + geom_point() +
geom_text(aes(label = res_labels), nudge_y = 0.3) + ylim(c(-4,4)) +
geom_hline(yintercept = c(-2.5, 2.5), colour = "red", linetype = "dashed")
#attacking variables
std_res_att <- rstandard(fit_ind_att)
points_att <- 1:length(std_res_att)
res_labels_att <- if_else(abs(std_res_att) >= 3, paste(points_att), "")
ggplot(data = NULL, aes(x = points_att, y = std_res_att)) + geom_point() +
geom_text(aes(label = res_labels_att), nudge_y = 0.3) + ylim(c(-4,4)) +
geom_hline(yintercept = c(-3, 3), colour = "red", linetype = "dashed")
#defensive variables
std_res_def <- rstandard(fit_ind_def)
points_def <- 1:length(std_res_def)
res_labels_def <- if_else(abs(std_res_def) >= 3, paste(points_def), "")
ggplot(data = NULL, aes(x = points_def, y = std_res_def)) + geom_point() +
geom_text(aes(label = res_labels_def), nudge_y = 0.3) + ylim(c(-4.2,4.2)) +
geom_hline(yintercept = c(-3, 3), colour = "red", linetype = "dashed")
5. Influential points
#team win %
cook <- cooks.distance(fit)
cook_labels <- if_else(cook >= 0.2, paste(points), "")
ggplot(data = NULL, aes(x = points, y = cook)) + geom_point() +
geom_text(aes(label = cook_labels), nudge_y = 0.01)
#attacking variables
cook <- cooks.distance(fit_ind_att)
cook_labels <- if_else(cook >= 0.04, paste(points_att), "")
ggplot(data = NULL, aes(x = points_att, y = cook)) + geom_point() +
geom_text(aes(label = cook_labels), nudge_y = 0.01)
#defensive variables
cook <- cooks.distance(fit_ind_def)
cook_labels <- if_else(cook >= 0.04, paste(points_def), "")
ggplot(data = NULL, aes(x = points_def, y = cook)) + geom_point() +
geom_text(aes(label = cook_labels), nudge_y = 0.01)
outliers <- c(23, 28, 29)
filtered_team_df <- df_team_pts40 %>%
filter(!Rk %in% outliers)
fit2 <- fit <- lm(winP ~ team_pts40 + AST40 + BLK40 + TRB40, data = filtered_team_df)
tidy(fit2, conf.int = TRUE)
## # A tibble: 5 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -320. 23.7 -13.5 3.93e-12 -369. -271.
## 2 team_pts40 20.7 1.55 13.4 4.93e-12 17.4 23.9
## 3 AST40 -4.17 3.16 -1.32 2.00e- 1 -10.7 2.38
## 4 BLK40 4.40 9.83 0.447 6.59e- 1 -16.0 24.8
## 5 TRB40 0.425 2.86 0.149 8.83e- 1 -5.50 6.35
summary(fit2)
##
## Call:
## lm(formula = winP ~ team_pts40 + AST40 + BLK40 + TRB40, data = filtered_team_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9843 -2.3349 -0.0071 2.4187 7.3367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -319.789 23.663 -13.514 3.93e-12 ***
## team_pts40 20.656 1.546 13.360 4.93e-12 ***
## AST40 -4.171 3.159 -1.320 0.200
## BLK40 4.396 9.828 0.447 0.659
## TRB40 0.425 2.856 0.149 0.883
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.916 on 22 degrees of freedom
## Multiple R-squared: 0.9251, Adjusted R-squared: 0.9114
## F-statistic: 67.89 on 4 and 22 DF, p-value: 4.68e-12
#team win %
res <- residuals(fit)
fitted <- predict(fit)
ggplot(data = NULL, aes(x = fitted, y = res)) +
geom_point(colour = "dodgerblue") +
geom_smooth(se = FALSE, colour = "magenta")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#attacking variables
res <- residuals(fit_ind_att)
fitted <- predict(fit_ind_att)
ggplot(data = NULL, aes(x = fitted, y = res)) +
geom_point(colour = "dodgerblue") +
geom_smooth(se = FALSE, colour = "magenta")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#defensive variables
res <- residuals(fit_ind_def)
fitted <- predict(fit_ind_def)
ggplot(data = NULL, aes(x = fitted, y = res)) +
geom_point(colour = "dodgerblue") +
geom_smooth(se = FALSE, colour = "magenta")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
- Homoscedasticity is maintained for the team win percentage model, yet is present in both individual player models
ggplot(data = NULL, aes(sample = res)) + stat_qq() + stat_qq_line()
ggplot(data = NULL, aes(x = res)) + geom_histogram(colour = "black", fill = "dodgerblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- The data is slightly skewed to the left and tailing to the right, however having investigated the data i am comfortable with the underlying context
#team
pairs(formula = ~ winP + team_pts40 + AST40 + BLK40 + TRB40, data = df_team_pts40)
sqrt(car::vif(fit))
## team_pts40 AST40 BLK40 TRB40
## 1.264263 1.272138 1.221721 1.219421
#attacking variables
pairs(formula = ~ salmil + pts40_ind + AST40, data = df_players, subset = G > 10)
sqrt(car::vif(fit_ind_att))
## AST40 pts40_ind
## 1.03044 1.03044
#defensive variables
pairs(formula = ~ salmil + pts40_ind + BLK40 + TRB40, data = df_players, subset = G > 10)
sqrt(car::vif(fit_ind_def))
## BLK40 TRB40 pts40_ind
## 1.252689 1.272556 1.020010
SOme data points appear to have an influence on creating non-linear patterns in the data, however i have investigated these outliers previously and am happy to keep them in the data model.
Below are the processed data files that i then wrote to the repos directory.
write.csv(df_players, file = 'data_processed/2018-19_nba_player-salaries-stats_.csv', row.names = FALSE)
write.csv(df_team_pts40, file = 'data_processed/2018-19_nba_teamdata_standardised_.csv', row.names = FALSE)
Model descriptions
(all explanatory variables are standardised to a rate of ‘per 40min’)
\(Y = Bo + B1x1 + B2x2...+ Bnxn\)
\(Team Win Percentage = -324 + (21.42 x Team Points) + (-2.06 x Assists) + (10.54 x Block) + (-2.92 x Total Rebounds)\)
\(Salary estimate = -5.760 + (7.66 x Assists) + (6.11 x Points Scored)\)
\(Salary estimate = -5.337 + (-9.20 x Blocks) + (3.22 x Total Rebounds) + (6.58 x Points Scored)\)
#TRB
df_players %>%
filter(G > 10, Pos %in% c("C", "PF", "PG", "SF", "SG")) %>%
ggplot() +
geom_histogram(aes(x = TRB40, fill = Pos), colour = "black", bins = 40) + facet_wrap(~Pos, nrow = 5) +
labs(x = "Total Rebound Rate", y = "Count")
#BLK
df_players %>%
filter(G > 10, Pos %in% c("C", "PF", "PG", "SF", "SG")) %>%
ggplot() +
geom_histogram(aes(x = BLK40, fill = Pos), colour = "black", bins = 40) + facet_wrap(~Pos, nrow = 5) +
labs(x = "Blocking Rate", y = "Count")
#assist
df_players %>%
filter(G > 10, Pos %in% c("C", "PF", "PG", "SF", "SG")) %>%
ggplot() +
geom_histogram(aes(x = AST40, fill = Pos), colour = "black", bins = 40) + facet_wrap(~Pos, nrow = 5) +
labs(x = "Assist Rate", y = "Count")
#pts40
df_players %>%
filter(G > 10, Pos %in% c("C", "PF", "PG", "SF", "SG")) %>%
ggplot() +
geom_histogram(aes(x = pts40_ind, fill = Pos), colour = "black", bins = 40) + facet_wrap(~Pos, nrow = 5) +
labs(x = "Points Scored Rate", y = "Count")
# create panels by match_outcome
new_dat <- df_team_pts40 %>%
select(Team, winP, team_pts40, AST40, BLK40, TRB40)
fit <- lm(winP ~ AST40 + BLK40 + TRB40 + team_pts40, data = df_team_pts40)
new_dat <- mutate(new_dat, exp_win_perc = predict(fit, newdata = new_dat))
ggplot(new_dat, aes(winP, exp_win_perc, label = Team)) + geom_point(colour = "dodgerblue") +
geom_text_repel(nudge_x = 4, cex = 3) +
labs(x = "Actual Win %", y = "Expected Win % (model)") +
geom_abline(linetype = "dashed", colour = "magenta")
#test correlation between expected win % from the model and actual win %
cor(x = new_dat$winP, y = new_dat$exp_win_perc, method = "pearson")
## [1] 0.9608395
#indiv attack
df_players <- ungroup(df_players)
fit_ind_att <- lm(salmil ~ AST40 + pts40_ind, data = df_players)
df_players <- mutate(df_players, exp_salmil_att = predict(fit_ind_att, newdata = df_players))
df_players %>%
filter(G > 10, !Pos %in% c("C", "PF", "C-PF", "PF-C")) %>%
ggplot () +
geom_point(mapping = aes(x = exp_salmil_att, y = salmil, colour = Pos)) +
labs(x = "Expected Sallary (millions)", y = "Actual Sallary (millions)") +
geom_abline(linetype = "dashed", colour = "magenta")
#indiv defence
fit_ind_def <- lm(salmil ~ BLK40 + TRB40 + pts40_ind, data = df_players)
df_players <- mutate(df_players, exp_salmil_def = predict(fit_ind_def, newdata = df_players))
df_players %>%
filter(G > 10, !Pos %in% c("SG", "SF", "PG", "SF-SG", "SG-SF")) %>%
ggplot () +
geom_point(mapping = aes(x = exp_salmil_def, y = salmil, colour = Pos)) +
labs(x = "Expected Sallary (millions)", y = "Actual Sallary (millions)") +
geom_abline(linetype = "dashed", colour = "magenta")
#attack discrepancy by score productivity
Selection_att <- df_players %>%
filter(G > 10, GS > 10, !Pos %in% c("C", "PF", "C-PF", "PF-C")) %>%
mutate(discrp_att = exp_salmil_att - salmil,
sc_inv = pts40_ind + AST40,
rank_att = discrp_att / (pts40_ind + AST40))
#Defence
Selection_def <- df_players %>%
filter(G > 10, GS > 10, !Pos %in% c("SG", "SF", "PG", "SF-SG", "SG-SF")) %>%
mutate(discrp_def = exp_salmil_def - salmil,
sc_def = pts40_ind + BLK40 + TRB40,
rank_def = discrp_def / sc_def)
This playing group, while starting would average:
21.1 Points per 40 minutes of play: placing 1st in the NBA in scoring rate
9.2 Assists per 40 minutes of play: placing 1st in the NBA in assists
0.3 Blocks per 40 minutes of play: placing 30th in the NBA in block (ideally this means the opposition isn’t taking shots)
5.1 Total rebounds per 40 minutes of play: placing 30th in the NBA in Total Rebounds